On Finding the Jaccard Center
نویسندگان
چکیده
We initiate the study of finding the Jaccard center of a given collection N of sets. For two sets X,Y , the Jaccard index is defined as |X ∩ Y |/|X ∪ Y | and the corresponding distance is 1− |X ∩Y |/|X ∪Y |. The Jaccard center is a set C minimizing the maximum distance to any set of N . We show that the problem is NP-hard to solve exactly, and that it admits a PTAS while no FPTAS can exist unless P = NP . Furthermore, we show that the problem is fixed parameter tractable in the maximum Hamming norm between Jaccard center and any input set. Our algorithms are based on a compression technique similar in spirit to coresets for the Euclidean 1-center problem. In addition, we also show that, contrary to the previously studied median problem by Chierichetti et al. (SODA 2010), the continuous version of the Jaccard center problem admits a simple polynomial time algorithm. 1998 ACM Subject Classification F.2.2 Computations on Discrete Structures
منابع مشابه
Algorithms for Data Science: Lecture on Finding Similar Items
Finding similar items is a fundamental data mining task. We may want to find whether two documents are similar to detect plagiarism, mirror websites, multiple versions of the same article etc. Finding similar items is useful for building recommender systems as well where we want to find users with similar buying patterns. In Netflix two movies can be deemed similar if they are rated highly by t...
متن کاملDocument Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure
Clustering is one of the very powerful and widely used technique in information retrieval. All clustering methods works on finding relationship among data objects. There are various similarity measures used along with criterion functions to find similarity between documents like cosine, jaccard etc. Clustering efficiency and performance is highly dependent on the accuracy of the similarity meas...
متن کاملThe Voluntary Approach to GHG Reduction: A Case Study of BC Hydro
Voluntary programs for environmental protection are increasingly popular with governments, but it is difficult to assess the extent to which such programs change the behaviour of firms. We conduct a hindsight decision analysis of the electricity supply strategy that BC Hydro chose in the late 1990s while it participated in a Canadian government program for voluntary greenhouse gas (GHG) reducti...
متن کاملSkin lesion segmentation based on preprocessing, thresholding and neural networks
This abstract describes the segmentation system used to participate in the challenge ISIC 2017: Skin Lesion Analysis Towards Melanoma Detection. Several preprocessing techniques have been tested for three color representations (RGB, YCbCr and HSV) of 392 images. Results have been used to choose the better preprocessing for each channel. In each case a neural network is trained to predict the Ja...
متن کاملHyperMinHash: Jaccard index sketching in LogLog space
In this extended abstract, we describe and analyse a streaming probabilistic sketch, HYPERMINHASH, to estimate the Jaccard index (or Jaccard similarity coefficient) over two sets A and B. HyperMinHash can be thought of as a compression of standard logn-space MinHash by building off of a HyperLogLog count-distinct sketch. For a multiplicative approximation error 1+ on a Jaccard index t, given a ...
متن کامل